A deep learning framework for epileptic seizure detection based on neonatal EEG signals

Electroencephalogram (EEG) is one of the main diagnostic tests for epilepsy. The detection of epileptic activity is usually performed by a human expert and is based on finding specific patterns in the multi-channel electroencephalogram. This is a difficult and time-consuming task, therefore various attempts are made to automate it using both conventional and Deep Learning (DL) techniques. Unfortunately, authors do not often provide sufficiently detailed and complete information to be able to reproduce their results. Our work is intended to fill this gap. Using a carefully selected 79 neonatal EEG recordings we developed a complete framework for seizure detection using DL approch. We share a ready to use R and Python codes which allow: (a) read raw European Data Format files, (b) read data files containing the seizure annotations made by human experts, (c) extract train, validation and test data, (d) create an appropriate Convolutional Neural Network (CNN) model, (e) train the model, (f) check the quality of the neural classifier, (g) save all learning results.

Main contributions of the paper. The main contributions of this paper can be listed as follows: 1. We have proposed a DL-like framework based on CNN for detecting seizure activities and test its usability on a real neonatal EEG dataset. 2. We have proposed a sliding window design to generate fully balanced training data. The design can greatly increase the amount of data which is then fed to the neural network. This can be seen as a kind of data augmentation and this process is crucial for CNNs which typically require large amounts of data to operate effectively and produce useful results. 3. We have developed a solution for reading raw EDF and annotation files with seizure indications made by human experts. Based on these data a training dataset for CNN network is generated and saved in HDF5 format. This work was programmed in R programming environment and shared to the user as ready-to-use R scripts. 4. We have developed a CNN model which can be successfully trained to detect seizure episodes. The obtained results of the classification (at the level of 96-97%) should be considered almost perfect. This work was programmed in Python programming environment and shared to the user as a ready-to-use Python Jupyter notebook. 5. We have made it our priority to ensure that all the presented results are fully reproducible by other researchers. Therefore, all the source codes as well as all the output results obtained by the authors have been included in the Supplementary Information files. Detailed instructions on how to do this have been also included. We consider this point particularly important. To cite a very extensive review work 14 , we have that "... the great majority of papers did not make their code available. Many papers reviewed are thus more difficult to reproduce: the data is not available, the code has not been shared, and the baseline models that were used to compare the performances of the models are either nonexistent or not available.". 6. The study will also help readers to analyze their own EEG datasets with only minor modifications to our R and Python codes (adjusting them to possible differences in the EEG data used and in the way seizures are annotated).
The overall workflow of the proposed system, schematically depicted in Fig. 1, is decomposed into 4 main phases: (1) preprocessing of the raw EEG recordings and annotation files, (2) building CNN model, (3) training CNN model, (4) generating final classification results. The preprocessing stage is designed to load the input data (raw EDF and annotation files) and convert it to a format that can be submitted to the CNN model. This step has been implemented in the R software version 4.1.2 17 . Building a CNN model, training it and finally generating all the results has been implemented in TesnorFlow version 2.8.0 18 and delivered as a Python Jupyter notebook 19 .

Methods
Cohort. The study was conducted on a carefully selected 79 neonatal EEG recordings dataset. The neonates were admitted to the neonatal intensive care unit (NICU) at the Helsinki University Hospital between 2010 and 2014. The cohort is described in detail in 5 , please refer to the source text. Moreover, the relevant ethics approval has been included therein. All experiments were performed in accordance with the relevant guidelines and regulations.
The annotation files are sampled with one second resolution. The detailed structure of these files is described in 5 . Since reading this data directly from CSV or MAT files is quite inconvenient, we have collected basic quantitative data on seizures and included them in two tables. Table 6 shows how many seizures were annotated for each infant by each of the three experts. Note that we have 40 neonates with seizures annotated by 3 experts and 17 neonates had seizures annotated by 1 or 2 experts. 22 neonates were seizure free. The experts are marked as A, B or C. Table 7 shows a complete list of lengths of seizures annotated by 3 experts (in whole seconds). The total number of seizures is 1,379, which is obviously the same as shown in the last lines of Table 6. Tables 6 and 7 are very long but the authors decided to include them in its entirety, as obtaining this data from CSV files manually would be very time consuming. The use of an appropriate software here is basically essential. An additional summary of the annotations is provided in Table 8.
Let us note here that in many cases there is a discrepancy in the annotations of individual experts. For example, for infant number 41, experts A and C indicated significantly more seizures than expert B. The lengths of individual seizures also very often vary between experts. Such a variety of end results (no consensus among experts) is rather quite natural in the field of EEG signal analysis 4 .
Software. The raw EEG recordings were preprocessed (reading and saving in Hierarchical Data Format (HDF5)) using the R software, version 4.1.2 17 . HDF5 format was chosen because it is an ideal choice for storing and organizing large amounts of data.
In our research Keras DL library was used to develop the CNN model 22 . It is also worth to note that Keras is a wrapper to TensorFlow's framework 18 . Keras was adopted and integrated into TensorFlow in mid-2017. Users can access it via the tf.keras module. TensorFlow, on the other hand, is an open-source DL framework developed by Google and released in 2015. Typically, one can define a model with Keras' interface, which is easier to use, then drop down into TensorFlow if you need to use a feature that Keras doesn't have, or you're looking for a specific TensorFlow functionality.
Due to the required great computing power, our code was run in the Colaboratory cloud service hosted by Google 23 , where fast GPU graphics cards are available (https:// colab. resea rch. google. com). The Google Colaboratory service allows users to write and run Python code directly in the WWW browser, which is an extremely convenient solution. A similar functionality is offered by the Kaggle service (https:// www. kaggle. com/). Data preprocessing. This chapter describes in detail how to prepare datasets for further analysis. This is a very important issue that, if not done properly, may have an impact on the final results of EEG signals classification. Unfortunately, in many papers the authors omit a more detailed description of this stage. We would like to fill this gap here. There are several steps involved in this process, as described below.
• Step 1. Selection of EEG recordings: The data is analyzed separately for each expert (A, B or C). We are dealing here with a binary classification (seizure / non-seizure). Therefore, it is necessary to select from the available EEG signals those that were assessed by experts as seizures and those assessed as seizure free.  . We mark this subset as EXP12. Due to the ambiguity in the expert opinion this subset was excluded from the analysis. Table 6 summarises the three subsets.
• Step 2. Bipolar montage: In the next step the bipolar montage was generated as described in "Input data" section. At this point, it should be noted that the order of signals in individual EDF files is different, so it is required to always set them in the same order. It is a small but very important part of data preprocessing. For example in the EDF file of the infant No. 1 the order of raw signals is Fp1, Fp2, F3, F4, C3, C4, P3, P4, O1, O2, F7, F8, T3, T4, T5, T6, Fz, Cz, Pz while in the file of the infant No. 2 the order is Fp1, Fp2, F3, F4, F7, F8, Fz, C3, C4, Cz, T3, T5, T4, T6, P3, P4, Pz, O1, O2. This step has been implemented in R. • Step 3. Down-sampling: Sampling frequency of the EEG recordings was set to 256 Hz. In the case of analyzes using neural networks, this frequency is too high and unnecessarily increases the size of the input data (already quite large). Therefore, the data is down-sampled. After performing various experiments, the authors concluded that the optimal down-sampling coefficient should be 4. This means that signals with a frequency of 64Hz are fed to the input of the neural network. Reducing the frequency can, in a sense, be treated as a form of data smoothing. Figure 3 shows two fragments of EEG recordings, each 3 seconds long. In the upper figure, the signal frequency is 256 Hz and in the lower figure it is down-sampled to 64Hz. The aforementioned smoothing effect is clearly visible. Down-sampling has been implemented in R. • Step 4. Sliding window design: From Table 6 one can calculate that an average of 460 seizures were annotated per expert in the EEG dataset. This number is definitely too small to effectively train neural networks (especially when training convolutional neural networks). Therefore, we used a sliding window technique to increase the amount of data which is then fed to the neural network. The second important task of the proposed sliding window design is to select a balanced number of seizure and non-seizure chunks. The process of preparing training data for CNN consists of two steps: a) selection of positive and b) selection of negative samples from all recorded EEG signals. A positive sample is a chunk/fragment with an annotated seizure, a negative sample is a seizure-free chunk/fragment. The design is illustrated in Figs. 4, 5 and 6. In all three figures the F3-C3 channel of infant # 1 is depicted (arbitrary selected by the authors). At the top panel there are two EEG signals with seizures annotated by expert A. The first one begins at the 104th second and ends at the 121st second, the second seizure begins at the 6847th second and ends at the 6863rd www.nature.com/scientificreports/ second (see Table 7). At the bottom panel there is the F3-C3 channel of infant # 10 which is seizure free with randomly selected appropriate number of chunks (5, 4 and 10, respectively). We have two parameters at our disposal (window and chunks). Using them, we can define what the resulted data samples will look like. In Fig. 4 we have given window=6 and chunks=3. This means that we want to choose 3 chunks from every annotated seizure, each 6 seconds long. Note that the second seizure is 17 seconds long, so actually it is possible to select only 2 and not 3 chunks (otherwise, we will fall into a The top signal has 2 seizures annotated by expert A. The first one starts at 104th second and ends at 121st second and is 18 seconds long. The second one starts at 6847th second and ends at 6863rd second and is 17 seconds long (see Table 7). By setting the appropriate values for the window and chunks variables, we can control the length of the samples (window variable) and their total number ( chunks variable). The window length was set to 6 seconds and the number of chunks was set to 3. Note, that the length of the second seizure fragment is 17 seconds. Consequently, it is possible to select only two chunks from the second seizure (although we assumed that we are selecting 3 chunks). From the first seizure one can safely get 3 chunks. The bottom EEG signal has no seizures annotated. We select randomly the same number of chunks (i.e. 5) as we have selected from the top EEG signal. Thanks to this method of selecting chunks, the number of seizure and non-seizure chunks is well balanced. The starting and ending seconds were chosen randomly (form example form 44 to 49 etc.).   Fig. 4. The EEG signals are the same. The figures differ in that window and chunks parameters have different values. Note also that now window length was set to 2 and 5 chunks from each of the two seizure fragments were selected, see top picture (a). Consequently, 10 chunks were selected form the non-seizure signal, see bottom picture (b). www.nature.com/scientificreports/ non-seizure area). The first seizure is 18 seconds long, so it is possible to select 3 chunks. In Fig. 5 we have given window=5 and chunks=2 and now the lengths of both seizures allow you to select 2 chunks. In Fig. 6 the window size is set to 2 and the desired number of chunks is 5. Next, we need to select a relevant number of seizure-free chunks. The binary classification (seizure/nonseizure) requires that the dataset is well balanced. In the context of classification task, this means that the sizes of seizure and non-seizure samples should be more or less the same. The non-seizure chunks are randomly selected from non-seizure EEG signals (bottom panels in Figs. 4, 5 and 6, patients in the group EXP0, see Table 6). As the result, there is no danger that both subsets will be unbalanced. The total number of nonseizure chunks is 5, 4 and 10 respectively in our examples. window and chunks parameters can of course be set to any integer values, according to your needs.
The above-described method of selecting windows and chunks has been implemented in R. Note also that there are studies in which the authors propose methods that allow for the effective detection of epileptic seizures for imbalanced EEG recordings 24,25 . However, our solution based on the CNN approach requires that the data be fully balanced, hence we use sliding windows design described above. Our design, by definition, guarantees the generation of a fully balanced data set. If unbalanced data were fed to the CNN network, the obtained results (binary classification: seizure / non-seizure) would be less reliable and accurate.
We also point out a subtle difference. In the field of EEG signal analysis, the term epoch is used. EEG epoching is a procedure in which specific time-windows are extracted from the continuous EEG signal. In our approach we use the term window and not the epoch to emphasize a slightly different meaning. We do not divide the entire EEG signal into epochs, but select only the fragments that interest us, which we call windows, please see Figs. 4, 5 and 6 for explanation. Because CNN networks require large amounts of data to function properly (mainly we mean reducing the phenomenon known as overfitting), we also introduce the concept of chunks, which allows us to increase the amount of training data we have in our disposal. Let us also mention that the concept of chunks is somewhat similar to the commonly used data augmentation, a powerful technique for mitigating overfitting in computer vision. Note also that some authors propose epoch reduction approach for better accuracy of the model 26,27 but in our case, this technique is not applicable.
• Step 5. Saving data in HDF5 format: After completing all the above steps, we obtain the final matrix where fragments with and without seizures are present. For the case shown in Fig. 4, the size of the matrix will be 19 × 20 , see illustrative Fig. 8 (the last row is the class indicator, 1 means seizure, 0 means non-seizure). Note also that all 18 channels are analyzed simultaneously. The data in this form is then saved in HDF5 format which is very convenient for storing large files of numeric data in an efficient binary format. The saved HDF5 files are passed as input to the appropriate Python routines that implement CNN learning. Note that in reality matrices generated from our real EDF files will be much bigger. After down-sampling our EEG dataset every second is represented by 64 datapoints (see Step 3 above). Therefore, the matrix for the www.nature.com/scientificreports/ data schematically depicted in Fig. 4 would be 19 × 3840 . Moreover, when working with real data, matrices will be even many times larger since multiple seizures are marked in EDF files and EEG recordings are often longer than 20 seconds (unlike those shown in the Figs. 4, 5 and 6). Additionally, the seizures annotated last for many seconds (see Table 7) and the window and chunks parameters can actually take values greater than those in the toy example shown above. For example from Deep learning CNN architecture. The CNN model used in our research has the structure shown schematically in Fig. 7 using summary function implemented in Keras. Its structure is the result of many experiments and tests aimed at developing the most optimal structure possible. Summary of the most important elements of the CNN architecture is depicted in Table 1.
Input data format for CNN. The data stored in the form of two-dimensional matrix shown in Fig. 8 cannot become directly the input for the CNN network implemented in Keras system. It is required to transform it (in other words: rearranging) into the so-called tensors form. Tensor is nothing but a generalization of the concept of a vector or matrix. Only in this form the data can be used in CNN. For those interested, we recommend a very clearly written book 15 . Details of the rearranged matrix are shown in Fig. 9. A tensor of size 10 × 384 × 18 and a vector of size 10 × 1 are created. The rearranging has been implemented in Python. Looking at the tensor it is easy to notice how the individual chunks are organized. In Fig. 10 four randomly chosen pairs of seizure/non-saizure chunks are depicted. The individual EEG signal values are represented as colormaps. It is easy to notice that the analysis of EEG signals, in the form of time series, de facto leads to the analysis of two-dimensional images. Upper plots show seizure chunks and the lower ones show non-seizure chunks. A certain pattern can be seen in Fig. 10a,b. Top drawings appear more blurry. However, in Fig. 10c,d the human eye cannot see any clear differences. However, very good results of the classification with the use of the CNN approach prove (not for the first time anyway) that deep neural networks learning is able to successfully solve the seemingly unsolvable tasks.
Training, validation and test data. In order to make CNN working correctly, it is necessary to split the data into three parts: a) training, b) validation and c) test. The model is trained on the training data and its accuracy is constantly checking using the validation data. Once the model is trained, it is tested on the test data. The test set is not involved in the process of building and tuning the model. This is the basic principle that guarantees the objectivity of the obtained results. The splitting data into test and validation sets is usually done randomly. The result of the validation stage will therefore depend on which elements of the dataset will be used during validation and which during the training stages. In this case, the validation result will not be reliable. The best practice in such situations is to use K-fold cross-validation. It is based on splitting the available data into K folds (see Fig. 10), creating K identical models and training each of them on K-1 folds. The model is evaluating on the remaining fold. The final validation score is the average of the K validation scores obtained. In our implementation the training and validation subset contains 80% of all data and the test subset contains 20%. The K parameter was set to 5.
The input data for CNN are in the form of tensors and vectors, as shown in Fig. 9. Tensors contain both positive and negative samples and every sample is basically treated completely independently. This is in line with the neural networks principles: the neural network should be provided with as much training data as possible without making any assumptions about the relationship between the training samples. In other words, we do not make any additional assumptions about the acceptance or rejection of a given sample. Each of them is treated the same and it does not matter which patient it comes from.
As an example please refer to Table 6. We can see that expert A annotated 385 seizures in the set named EXP3. Using our sliding design methodology, let's assume that window=2 and chunks=2. So, we obtain www.nature.com/scientificreports/ 385 × 2 × 3 = 1540 positive samples in total that will go to the input of the CNN network. Consequently, all available seizure signals are used and no one is left-out. Because neural networks work best when the training data is balanced, therefore, in the next step, we select the same number of negative samples. To make data fully balanced, we select exactly 1540 negative chunks, each with a length of 2 seconds. The negative samples are taken from the patients marked as EXP0 (i.e. without any annotated seizures). Consequently, our CNN network receives 3080 samples. This set is then randomly split into the training-validation part and test part (80% vs. 20%, i.e. 2464 vs. 616 samples). Finally, the training and validation subset is randomly split according to the cross-validation methodology as visualized in Fig. 11. For K = 5 in every fold 2464 × 4/5 = 1970 samples is used for training and 2464 × 1/5 = 494 samples is used for validation. The remaining 616 samples are used to evaluate the accuracy of the trained CNN models. The results are summarised in Tables 3, 4 and 5.

Results
The seizure annotations presented in 5 are shared in a specific non-standard format. Therefore, in the first place, we have developed software that allows one to easily load this data and, on the basis of it, prepare batches that can be used as input data for CNN. This part of the software was implemented in the R system. The data generated have the structure shown in Fig. 8. In our experiments we decided to choose the following values for the window and chunks variables: window=[1,2,5,10,20] and chunks=[1,2,5,10,20,10000]. 10000 means that the maximum possible set of contiguous chunks was selected. We can safely set chunks to 10000 and this way we are sure that the maximum possible set of chunks will be selected. Our dataset simply doesn't have seizures as long as 10,000 seconds. Using these values 30 different datasets were generated for annotations prepared by each of the experts A, B and C. This makes a total of 90 different datasets saved as HDF5 files, see "Replicate the results" section for detailed explanation how to generate these files, how and where they are stored and what their naming convention is.
The CNN learning results are collected in Tables 3, 4 and 5. The best obtained test-set accuracy, the longest computation time and the biggest data size in chunks are printed in bold. We would like to point out here that the obtained results of the classification (at the level of 96%-97%) should be considered very good, almost perfect. It should be emphasized, however, that in order to obtain such results, large amounts of training data are required. For this reason, in principle, a sliding window design was developed and implemented.
It is worth noting that the learning process of CNNs is not deterministic. This means that, in principle, we are not able to obtain exactly the same results by performing the same calculations again and again. Each time the results will be slightly different. Nevertheless, the differences will not be too great. All calculations are performed 5 times (fivefold cross validation scheme) and then the average of all partial results is calculated. In Tables 3, 4 and 5 these average results are shown. Nevertheless, all the partial results are included in Electronic Supplements (in the results directory, see the directory structure in "Replication of the results" section).
In the tables we also show average computation time (rounded to full minutes). These results should be treated with a certain distance. A lot depends on the type of GPU card and its temporary load. We worked in the Google Colaboratory and Kaggle cloud environments, where shared resources vary over time and they can vary quickly.
Replication of the results. In this section, we provide various details that will help one to replicate all the results of our numerical experiments. We would also like to point out that in some places the source codes are hard-coded with certain elements related to the specificity of the source data used. These are mainly: a) EDF file names, b) number and names of channels stored in EDF files, c) a coding system of seizures annotations. If the provided codes were to be used in the future to analyze a different data set of EEG signals, minor changes would have to be made. The authors declare the necessary help for potential researchers.
The overall workflow to reproduce the results obtained by the authors can be summarised in 7 steps which are shown schematically in Fig. 12. Upload 79 EDF files and 3 CSV files which you downloded in Step 1 to the edf and annotations directories. In the acc_loss, best_models, hists, logs, results, ROC and waveforms directories we have downloaded all our output results. However, you can regenerate these results yourself by running appropriate R and Python scripts, see the next steps below. The complete directory structure is given below and a short description of the content of individual working subdirectories is given in Table 2. hists Stores models' training and validation accuracy and loss values. This data allows you to prepare visualizations of network training, similarly to those depicted in Figs. 13 and 14. The data is saved in the PCKL format (implemented in the Python's pickle module) and as CSV text files. An example of how to use these files is shown in the enclosed Jupyter notebook (the pickle.load function) inputs Stores HDF5 files which are inputs for our CNN model. These files are created in R (EEG_neonatal.R script) using the raw EDF files which are stored in edf directory. To find out exactly which fragments of the original raw EDF files were used in HDF5 files (i.e. the exact samples numbers), files with names beginning with non_seizures_ and seizures_ are additionally generated logs Stores log files to be parsed by TensorBoard (TensorBoard is a tool for providing the measurements and visualizations needed during the machine learning workflow). Tables 3, 4 and 5 are the average of the K = 5 validation scores obtained using K-fold validation scheme. Additionally, execution times for every fold and GPU card types are given

Stores ROC curves along with the AUC metrics
waveforms Stores all EEG seizure waveforms annotated by 3 experts. There are 1379 waveforms in total, as depicted in Table 6. The lengths of the waveforms were arbitrarily set at 10 seconds. However, the user can generate waveforms with different lengths by running the generate_eeg_waveforms function in R. See the Supplementary Information files for details how to do this Figure 8. After selecting the desired number of chunks in Fig. 4 one must combine them in a matrix form.
In this example the matrix has 18+1 rows (the last row is a class indicator) and 10 columns. Every single cell represents a 6 seconds long EEG signal. Make sure that the current working directory is R. Set also the dir variable to the one indicating the appropriate directory structure in your local computer. The parameters of the generate_samples() function can be changed depending on your actual needs. Those that are saved in the EEG_neonatal.R script will generate exactly the same HDF5 files that were included in the Electronic Supplements. After generating Figure 9. The two-dimensional matrix shown in Fig. 8 cannot be fed into the neural network in this form. In Keras a 3D tensor is required. The figure shows how the 2D matrix must be divided into a tensor and a vector with seizure indicators. www.nature.com/scientificreports/ Table 3. Evaluation results for dataset based on annotations given by expert A. Evaluation was performed on the test set using fivefold cross-validation scheme (see Fig. 11). Three values are given for every window size and every number of contiguous chunks: (a) the test-set accuracy in %, (b) average computation time for fivefolds (see Fig. 11) rounded to full minutes, (c) total number of chunks (see tensor in Fig. 9). The given computation times should be treated as indicative as they are very dependent on the instantaneous loads in the Colab system used. 10,000 means that the maximum possible set of contiguous chunks was selected. We can safely set chunks to 10000 and this way we are sure that the maximum possible set of chunks will be selected. Our dataset simply doesn't have seizures as long as 10,000 seconds.  Table 4. Evaluation results for dataset based on annotations given by expert B. The rest of the caption is identical as in Table 3. www.nature.com/scientificreports/ HDF5 file data but you do not need to use them). The data files have the same logical structure as in Fig. 9 and use a uniform naming convention. For example, the file expert_C_5sec_2chunk_64Hz.hdf5 means that data was generated according the annotations made by expert C, the windows size was set to 5 seconds and the number of contiguous chunks was set to 2 (see Figs. 4 and 5). The similar naming convention was used for all other files in the working subdirectories. Note: we do not put HDF5 files in the regular Electronic Supplements, as their total size is about 16.6GB. However, for your convenience, we included them in separate zip archives, see "Data and code availability" section.
• Step 6. CNN processing: Open the EEG_neonatal.ipynb Jupyter notebook in your favourite Python 3 environment, local or cloud-based. Before the script is run, two global variables, namely WRK_DIR and INPUT_DIR, should be set, indicating the appropriate directories for your runtime environment. Leave the other global variables unchanged if you use input data provided by the authors (i.e. HDF5 files in the working/inputs directory).
The calculation results will be saved in the subdirectories of the working directory. The files share the same naming convention described above. For example the file: best_model_ expert_A_1sec_1chunk_64Hz_fold_0.h5 stores the best model obtained during training of the neural network using the input file expert_A_1sec_1chunk_64Hz.hdf5 during the first fold (fold_0, see Fig. 11. We start counting folds from 0 according to Python convention). To get the complete results presented in the paper, in the __Run__ block set the following values: In this place, we clearly point out that the calculations will take several days in total. It must be realized that calculations performed in the CNN environment, unfortunately, require enormous computing power. The computation time can be reduced five times, but at the cost of leaving the k-fold scheme. Then, in the Global variables block set COMPLETE_CALCULATIONS=False. However, the results obtained will be somewhat less objective. • Step 7. Inspecting final results: All final results are stored in the individual subdirectories of the working directory. These are: a) confusion matrices, b) accuracy, precision, recall and F-measure metrics, c) CNN processing accuracy and loss metrics as well as appropriate plots, d) ROC curves. Table 5. Evaluation results for dataset based on annotations given by expert C. The rest of the caption is identical as in Table 3.
We used sequential CNN model with such regularization techniques as dropout, max-pooling, batch normalization and L2 regularizers. It is important to note that we have developed a fairly flexible method of selecting the number of training samples (through the chunks parameter) and the length of individual samples (in seconds, through the window parameter). The user can thus very easily generate training data having the desired characteristics.
Our research basically confirms that deep neural networks, in order to perform their task well, must be provided with a sufficient amount of training data. The results presented in Tables 3, 4 and 5 show that the total number of training samples is not as important as the length of the individual samples (window parameter). The value window=5 seems to be optimal value. Increasing it does not bring much improvement. As for the chunks parameter, basically the higher its value, the better the results will be. However, keep in mind that the training time of the neural network learning process increases very quickly. The value chunks=20 gives quite good results.
In Fig. 13 we show an example of CNN training and validation accuracy (upper curves), as well as the training and validation loss (lower curves). The dataset was created on the basis of annotations made by expert B with the following parameters: window=5 and chunks=10000. In the context of learning CNNs, these curves can be considered almost ideal: accuracy is almost 1, loss is almost 0 and there is no very disadvantageous phenomenon called overfitting. Note also that in this example the input dataset size is large enough (23,979, see Table 4) that this unfavorable phenomenon does not occur. If, on the other hand, CNN receives too little training data (expert B, window=1 and chunks=1, see Fig. 14), overfitting occurs very quickly, in our example around the 50th epoch.
Notes on using the framework to analyze datasets other than those used in the article. In our study, we used the neonatal EEG data set, which is basically quite specific. Nevertheless, the proposed framework Table 7. Lengths (in whole seconds) of seizures for every infant annotated by 3 experts (marked as A, B and C). When a given expert did not mark any seizures for a given infant, it was marked with a hyphen (-).  In the first case the main requirement is that seizure annotations be in the same specific format (non-standard in fact) as our data. The annotations must be stored in a CSV file where each column corresponds to a subject (patient) and each row is the annotation of one second of the EEG recording (1 for seizure and 0 for nonseizure. Please study the 3 files in the annotations directory for better understanding the files structure). As for raw EDF files, please note that they may have a slightly different structure (different number of channels, different channel names, etc.). So if someone would like to use our codes to analyze their own EDF datasets, they must meet the following requirements. See also the Electronic Supplements for more information.
• EDF files must be readable by the read.edf() function (edf R package).
• We assume that EDF file names have the format like: eeg phrase and consecutive numbers of subjects, like eeg1.edf, eeg2.edf etc. Otherwise, some minor changes are required in the generate_samples() function. • The EEG channel names are hard-codded in the function generate_montage(). Depending on the current structure of your raw EDF files, this function must be appropriately adapted to this structure.
In the second case one must be aware that our CNN network has been trained on a certain dataset (quite specific) and is ready to recognize a certain type of seizures (i.e. neonatals ones). Therefore, it should not be expected that when we provide completely different data to our pre-trained CNN network (e.g. based on elderly patients), www.nature.com/scientificreports/ the network will correctly classify the data. Also some technical details on EEG recordings must be considered carefully. In our case, signals from 18 EEG channels connected according to the ' double banana' montage were fed to the CNN network. When the new data is not analogous, the classification results can be very questionable. Nevertheless, when the new data is compatible (in the sense as stated above), there are no major contraindications to feed them to our pre-trained CNN network. In the Electronic Supplements one can find some examples. The Python codes are quite universal and the only requirement is to set a few variables in the Global variables block in the Jupyter notebok included. We also assume that the input data filenames (given as HDF5 files) are in the format expert_XXX_YYYsec_ZZZchunk_VVVHz.hdf5 where: XXX -any string indicating for example a human expert who annotated seizures, YYY -window size in seconds, ZZZ -number of contiguous chunks, VVV -base frequency in the HDF5 file. Data stored in HDF5 files must conform to the format shown in Fig. 9.

Data and code availibility
All data generated and analysed during this study, as well as R and Python source codes, are included in Supplementary Information files. www.nature.com/scientificreports/